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Abstract — A key challenge in client-assisted content delivery 
is determining how to allocate limited server bandwidth across 
a large number of files being concurrently served so as to 
optimize global performance and cost objectives. In this paper, 
we present a comprehensive experimental evaluation of strategies 
to control server bandwidth allocation. As part of this effort, 
we introduce a new model-based control approach that relies 
on an accurate yet concise "cheat sheet" based on a priori 
offline measurement to predict swarm performance as a function 
of the server bandwidth and other swarm parameters. Our 
evaluation using a prototype system, SwarmServer, instantiating 
static, dynamic, and model-based controllers shows that static 
and dynamic controllers can both be suboptimal due to different 
reasons. In comparison, a model-based approach consistently 
outperforms both static and dynamic approaches provided it 
has access to detailed measurements in the regime of interest. 
Nevertheless, the broad applicability of a model-based approach 
may be limited in practice because of the overhead of developing 
and maintaining a comprehensive measurement-based model of 
swarm performance in each regime of interest. 

I. Introduction 

Faced with the challenge of ever-increasing demand for 
content, content distributors have turned to client-assisted 
content delivery in recent times. A client-assisted content 
delivery architecture enables content distributors to provide 
performance in a scalable and cost-effective manner by oppor- 
tunistically leveraging client resources, especially their uplink 
bandwidth, to augment their managed infrastructure resources. 
Although client-assisted content delivery systems have their 
roots in peer-to-peer file sharing systems fl]|, (2), commercial 
CDNs such as Akamai, Velocix, and Octoshape [3], [4|, as 
well as live streaming services such as PPLive and Sopcast |6|, 
Q have warmed up to using them for mainstream enterprise 
content delivery in recent times. 

A key problem in client-assisted content delivery is band- 
width management, i.e., determining how to allocate limited 
server bandwidth across a large number of files being concur- 
rently served to clients so as to balance the performance and 
cost objectives of the content distributor. Unlike purely client- 
server systems or purely peer-to-peer systems, this problem 
is particular to client-assisted content delivery systems that 
attempt to combine the predictable performance and ease 
of management of the former with the scalability and cost- 
effectiveness of the latter. The sever bandwidth allocated 
to a swarm, or a set of clients concurrently downloading 
the same file, is critical in determining the effectiveness of 
client-to-client exchanges and by consequence client-perceived 
performance. Furthermore, the appropriate allocation may be 
counter-intuitive, e.g., a popular file requires less server band- 
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width compared to an unpopular file, all else being equal, in 
order to ensure similar client-perceived performance. 

Our primary contribution is a measurement-driven com- 
parative analysis of several existing and new strategies for 
allocating server bandwidth in client-assisted content delivery 
systems. To this end, we classify these bandwidth allocation 
strategies, or controllers, into three categories. The first is 
static, a class of controllers that use simplistic strategies such 
as allocating bandwidth uniformly, on a best-effort basis, or 
proportional to the demand across files JT]. The second is 
dynamic, a class of controllers that constantly adjust the alloca- 
tion in response to fine-grained client-perceived performance 
so as to optimize the performance or cost objectives of the 
content distributor E), 10. 

In this paper, we present a third, new class of controllers 
called model-based controllers that allocate server bandwidth 
based on a predictive model of client-perceived performance as 
a function of the server bandwidth and other swarm parameters 
such as the request arrival rate, file size, and client upload 
capacities. Unlike dynamic controllers that can be suboptimal 
due to long convergence delays while searching for an optimal 
allocation in situ, model-based controllers can jump to the 
optimal allocation in a single step by solving the underlying 
optimization problem "on paper". 

We have implemented a prototype system, SwarmServer, to 
facilitate our comparative analysis of controllers. In addition 
to several simple static and dynamic controllers, Swarm- 
Server supports a model-based controller called CheatSheet 
for three bandwidth allocation objectives: minimizing the av- 
erage download time, maximum download time, or the server 
bandwidth consumed so as to achieve a target performance 
objective. CheatSheet uses extensive a priori measurement 
to develop an accurate and concise model of performance 
as a function of the server bandwidth and a number of 
swarm parameters. To our knowledge, CheatSheet is the first 
attempt at developing a detailed empirical model of swarm 
performance. 

Our extensive experiments with SwarmServer in conjunc- 
tion with BitTorrent swarms running over 350 PlanetLab 
nodes reveal several insights. First, simple static controllers 
are hit-or-miss; while they perform well for some performance 
objectives and workloads, even outperforming dynamic con- 
trollers, they fall severely short on others. The suboptimal 
performance of static controllers is unsurprising and consistent 
with previous findings [8 1 for one our three objectives of inter- 
est. Second, model-based control is feasible and promising — 
CheatSheet consistently outperforms both static and dynamic 
controllers provided its model is based on detailed a priori 



2 



measurements in an environment similar to the operational 
environment. CheatSheet performs up to 4x better than static 
schemes and up to 1.7x better than dynamic controllers. 

Nevertheless, having gone through the experience of build- 
ing a model-based controller, our conclusions about its practi- 
cality are somewhat mixed because of several reasons. First, it 
is hard. To appreciate this, consider that Cheatsheet's model 
used in the experiments in this paper alone required over 12 
days of measurement data on PlanetLab so as to account for 
a number of parameters such as the server bandwidth, request 
arrival rate, distribution of client upload capacities, file size, 
etc. Second, while a measurement-driven model is robust to 
small variations in the operational environment, significant 
changes require recalibrating the model. For example, we find 
that the model developed over PlanetLab is inaccurate when 
deployed on a public cloud such as Amazon EC2 or a local 
cluster in our department. Similarly, significant changes in the 
client population or behavior such as participation in multiple 
swarms introduce further uncertainties into the model. Thus, 
model-based control may be appropriate primarily for rela- 
tively predictable environments (e.g., distributing TV shows 
and movies to FIOS [10| customers). 

The rest of the paper quantifies these nuanced pros and cons 
of the three classes of controllers. We begin with a background 
on client-assisted content delivery. 

II. Background 

A client-assisted content delivery system consists of a server 
that acts as the primary source for all content. All clients 
concurrently downloading the same file are referred to as a 
swarm. Clients follow a common peer-to-peer protocol for 
downloading (uploading) the file from (to) other clients in the 
swarm. The server participates by contributing bandwidth to 
all swarms. In this paper, we focus on the BitTorrent protocol 
(T) because of its open nature and wide deployment, however 
our findings are qualitatively applicable to other comparable 
plugins offered by content distributors Q, 0. 

A key goal of a client-assisted content delivery system is to 
optimize a system-wide objective, e.g., minimize the average 
download time of all clients, by judiciously allocating limited 
server bandwidth across all swarms. To this end, a controller 
at the server collects information from all swarms and uses 
this information to compute and effect an allocation of server 
bandwidth so as to optimize the system-wide objective. 

A. Classification of controllers 

We classify existing controllers as static or dynamic, and 
introduce a new class called model-based controllers. 

Static: A static controller allocates server bandwidth using 
a simple heuristic while being agnostic to the system-wide per- 
formance objective and unresponsive to actual client-perceived 
performance. For example, a static controller that we analyze is 
using BitTorrent as-is by repurposing a common seeder across 
all swarms as the server (TJ. 

Dynamic: A dynamic controller continuously monitors fine- 
grained information about client-perceived performance for all 
clients in each swarm (see Figure [T] left), and accordingly 
adjusts the bandwidth allocation in each monitoring epoch. An 
example of a dynamic controller is AntFarm 1 8 ] that monitors 



the number of blocks uploaded and downloaded by each client 
in each epoch, uses a strategy based on perturbation and 
gradient-ascent in order to optimize the aggregate download 
rate across all clients across all swarms. 

Model-based: A model-based controller relies on a predic- 
tive model of swarm performance as a function of the supplied 
server bandwidth and other swarm parameters such as the file 
size, the peer arrival rate, and the upload capacity distribution 
of peers. Unlike dynamic controllers, a predictive model ob- 
viates explicit measurement of client-perceived performance, 
requiring only parameters that are already available or easily 
inferred at the server (see Figure [T] right). More importantly, 
it obviates in situ perturbation and gradual adjustment of 
the allocation enabling the controller to jump to the optimal 
allocation in a single step by using the model to solve the 
underlying optimization problem "on paper". Thus, a model- 
based controller can quickly adapt to sudden changes in 
request arrival rates. 



B. Limitations of dynamic control 

Our motivation for investigating model-based control stems 
from the limitations of dynamic controllers in realistic envi- 
ronments. Unlike static controllers that are but naive baseline 
strategies, the limitations of dynamic control are less obvious 
and are described next. 

Convergence time: Dynamic control works in a feedback- 
driven manner by perturbing the current allocation, monitoring 
the performance impact of the perturbation, and accordingly 
determining the next perturbation. This approach is prone 
to prohibitively long convergence delays, primarily because 
the effect of a perturbed allocation can take several minutes 
to propagate through the swarm so as to be observable by 
the controller. As an example, AntFarm updates its server 
bandwidth once every 300 seconds by 5KB/s, so an adjustment 
of 50KBps requires nearly an hour to take effect. 

Measurement overhead: Dynamic controllers utilize server 
resources to monitor every client's performance in a swarm; 
this overhead can be significant for a swarming system with 
tens of thousands of clients. 

Measurement error: The performance of any controller in 
steady state depends on how accurately it can estimate the 
relation between server bandwidth and swarm performance. 
Dynamic controllers can inaccurately estimate this relation 
because they measure swarm performance for the current 
bandwidth allocation only for a single measurement interval of 
a few hundred seconds. The statistical variations in the number 
of peers joining the swarm and in their upload capacities 
introduce error in measuring swarm performance. 

These limitations of dynamic controllers compel us to ex- 
plore model-based controllers. We hope that the measurement 
overhead could be relegated to an a priori offline phase to 
develop an accurate model of swarm performance in exchange 
for increased responsiveness in the operational phase. The 
challenge, of course, is to develop an accurate model of swarm 
performance with a tractable measurement overhead and small 
representation size, a challenge we address next. 
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Fig. 1. A comparison of dynamic and model-based control architectures. 



III. A Measurement-based Model 

In this section, we develop a measurement-based model of 
swarm performance - the key building block for a model-based 
controller. Unlike prior theoretical models ifTTl . lfT2ll . 1131 that 
over-simplify swarm behavior, our work, to our knowledge, 
is the first effort at developing a measurement-based model 
of swarm performance. Despite our progress, the proposed 
model falls short both because it requires very extensive mea- 
surements lasting several days, but more fundamentally due 
to the large number of factors that affect swarm performance, 
even with several simplifying assumptions. 

A. Goal and model assumptions 

We start with the following question: what is the average 
download time of peers in a BitTorrent swarm when given 
a certain amount of server bandwidth! The answer to this 
question of course depends on several characteristics of the 
swarm such as the arrival and departure patterns of peers, 
their upload and download capacities, the size of the file 
being distributed, etc. The answer also depends on design 
parameters of BitTorrent clients such as the number of active 
peers to which a peer concurrently uploads and how it splits 
its upload capacity across them, the length of an optimistic 
unchoke round, the size of chunks, etc. Finally, network 
conditions and artifacts of the transport protocol (TCP or 
custom transport protocols such as /iTP for non-interfering 
downloads [14|) will also impact swarm performance. Clearly, 
a model attempting to account for all of the factors affecting 
a swarm's performance quickly becomes intractable. 

To derive a simple yet useful model, we consider a swarm 
distributing a file of size S to peers arriving at a rate A. 
The upload capacities of arriving peers are drawn from a 
distribution with mean fi. The download capacity of peers 
is unlimited. Peers depart immediately after finishing their 
download (so the departure rate of peers is equal to the arrival 
rate A in steady state). Let x denote the (fixed) bandwidth 
supplied by the server. Our model postulates that the average 
download time of peers, r, can be determined as a function / 
of x, fi, A and S. We state this dependence as 

-=f(x,n,\,S) (1) 

T 

We call — as swarm performance. As the average download 
time of peers (r) reduces, swarm performance improves. 

By assuming that r is determined by the above four parame- 
ters alone, the model implicitly makes a few assumptions. The 
model assumes that network loss rates and round-trip times 



Fig. 2. Dependence of swarm performance on server bandwidth 
(x), and peer arrival rate A for S = 10MB and fi = lOOKBps. 

are not so high that they reduce the effective average peer 
upload capacity (or equivalently that fi already incorporates 
these effects). It also implicitly assumes that all peers use a 
standard BitTorrent client and that implementation variations 
across operating systems are minor. It further assumes that fi 
already incorporates the effect of user-specific configurations 
that limit their upload contribution. Finally, the model assumes 
that despite all these heterogeneous factors affecting the dis- 
tribution of peer upload capacities in practice, this distribution 
is stationary, so the average upload capacity /i (in conjunction 
with the other three parameters) is sufficient to determine the 
average download time. 

B. Measurement-based model 

We take an empirical, measurement-driven approach to 
capture the relationship in Equation (TJ~|). A naive approach 
to this end would be to "measure" the relationship posed in 
Equation ([T]l for all foreseeable values of the four underlying 
dimensions (x, fi, A, S), which is impractical. Instead, our ap- 
proach is to summarize the relationship using a small number 
of measured scenarios and use simple interpolation to estimate 
the unmeasured scenarios. We begin with a description of our 
measurement setup. 

1 ) Measurement setup: Our measurement testbed consists 
of 350 PlanetLab nodes installed with an an instrumented Bit- 
Torrent client 1 15 1, and two (non PlanetLab) servers hosted at 
our university that act as the seeder and the tracker respectively 
for all swarms. In each swarm run, peers arrive over time at a 
PlanetLab node to download the file and depart immediately 
after completing the download. Each swarm is run long enough 
so that the average download times of peers stabilizes, and 
the server records the average download time of peers that 
have completed downloads at the end of the experiment. Each 
swarm run is repeated five times with a fixed set of parameters 
(x, fi, A, S) and different runs vary these parameters. 

We use the upload capacity distribution of BitTorrent peers 
reported in lfT6l . which was scaled and truncated to remove 
very high capacity peers so as to accommodate the daily 
limit on the maximum data transfer imposed on PlanetLab 
nodes. The resulting average upload capacity (fi) is 100 KBps 
with upload bandwidths in the range of 40 to 200 KBps for 
individual peers. No restrictions are imposed on the maximum 
download rate of any client. The file size is fixed at S = 10 
MB. Peer inter-arrival times are exponentially distributed with 
mean 1/A. 

Figure [2] shows the aggregate results of our measurement 
experiments. Each line corresponds to a fixed arrival rate A as 
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shown, and plots the swarm performance for different values 
of the server bandwidth x that is varied from 10 to 100 KBps 
(also the average peer upload capacity) in 10 KBps increments. 
With these parameters, a swarm run takes between 2000 to 
5000 seconds, so the total running time to generate this figure 
is over 12 days (5 runs per point x 60 points x an hour 
roughly per run = 300 hours). 

2) Swarm performance vs. server bandwidth: Figure [2] 
presents several insights about how the swarm performance 
depends on server bandwidth and peer arrival rate. First, 
swarm performance as expected increases with server band- 
width keeping all else fixed. Second, swarm performance is 
concave with respect to server bandwidth. This is because, 
when the server bandwidth is very low, it becomes the bottle- 
neck preventing peers from efficiently utilizing their upload 
capacity for exchanging blocks. In this regime, increasing 
server bandwidth slightly improves the efficiency of P2P 
exchanges, which improves swarm performance significantly. 
At high values of server bandwidth, there is less room for 
improving the efficiency of P2P exchanges, so the server's 
bandwidth improves performance similar to traditional client- 
server systems, i.e., the bandwidth is divided across extant 
peers. When the server bandwidth equals the average peer 
upload capacity we find that a swarm's utilization of P2P 
bandwidth is about as efficient as it can be, and any additional 
server bandwidth is simply used as in a client-server system. 
As a result, the swarm performance in the regime x > /i (not 
shown in Figure|2]) can be easily derived analytically obviating 
time-consuming measurements. 

Third, in the regime x < \i shown in the figure, swarm 
performance improves with the arrival rate (keeping all else 
fixed). At very low arrival rates, e.g., A = 1/100/s, the swarm 
behaves like a client-server system as there is at most one peer 
most of the time, so the corresponding curve resembles the line 
y = x. At higher arrival rates, the swarm remains efficient 
(i.e., it maintains a healthy download rate of over 80 KBps) 
for values of x much smaller than fi. This is because large 
swarms are mostly self-sustaining and need only a tiny amount 
of server bandwidth to supply missing blocks in the unlikely 
event that none of the extant peers possess those blocks. 

3) Model representation: To concisely represent the 
swarm-performance model, we carefully select a small number 
of values of each parameter for measurements. We maintain a 
table, referred to as the "cheat sheet", that records the swarm 
performance for all combinations of these parameters. This 
cheat sheet is used to approximately estimate by simple linear 
interpolation the swarm performance for values of parameters 
that are not explicitly measured. Next, we describe how we 
select the values of the model parameters for measurements. 

a) Server bandwidth & peer arrival rate: The depen- 
dence of swarm performance on server bandwidth and peer 
arrival rate for a given upload capacity distribution and file 
size (as in Figure [2]) is captured using « 100 values. We take 
measurements for ten values of x ranging from /i/10 to \i, 
and for ten values of A in a range determined by a metric we 
refer to as the "healthy swarm size". The healthy swarm size 
is the number of peers when the efficiency of P2P exchanges 
in maximum. The intuition for healthy swarm size comes from 
Little's law ifFTl . healthy swarm size is A x Sj /i, as S/fi is the 



average download time of peers in this case. When the healthy 
swarm size is one or less, the swarm essentially behaves like 
a client-server system. We empirically observe that when the 
healthy swarm size is 50 or more, the swarm is essentially self- 
sustaining, i.e., even with a server bandwidth of just a /i/10, 
the swarm is efficient. So we take measurements for values of 
A selected such that the healthy swarm size XS/fj, increases 
from 1 to 50 in 10 equal increments. The total number of 
combinations of x and A is therefore 100. 
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b) File size: We address file size diversity using an 
interpolation approach similar to the one used for arrival rates 
and server bandwidth. A separate cheat sheet is stored for a 
small number of file sizes spanning the regime of interest, e.g., 
10 file sizes in geometric progression from 1MB to 10GB. The 
swarm performance for file sizes in between is estimated via 
interpolation. 

At the onset of this work, we expected that a larger file 
size could be treated as equivalent to a larger arrival rate, 
i.e., f(x, (i, X,kS) could be approximated as f(x,n,kX,S), 
thereby obviating the need to maintain separate cheat sheets 
for different file sizes. Our intuition was that XS (bits/sec) 
represents the aggregate demand arriving into the system, 
so the response curve should not change significantly if the 
demand remains unchanged. Unfortunately, this turns out not 
to be the case as shown by the experiment in Figure [3] The 
figure plots the swarm performance as a function of the server 
bandwidth, and the different lines increase (decrease) S (A) 
by the same factor, i.e., AS* is the same for all points in the 
graph. The lines clearly show a slight uptrend suggesting that 
larger file sizes boost swarm performance more than larger 
arrival rates or, equivalently, a swarm distributing a larger file 
performs better than a swarm distributing a smaller file even 
though both have the same aggregate demand, client upload 
capacities, and server bandwidth. 

c) Upload capacity distribution: There are two kinds 
of variations that occur in peer upload capacities. First, the 
upload capacity distribution of any sample of peers currently 
participating in a swarm may differ from the overall dis- 
tribution. Our model implicitly accounts for this statistical 
variation because peer upload capacities during measurements 
are chosen by randomly sampling the distribution. Second, 
the overall upload capacity distribution of peers visiting the 
site can change. However, we expect that upload capacity 
distribution is unlikely to change at short time scales, as it 
depends on technology trends and the population of users who 
visit the site, which is likely to remain stable over the course 
of several months. 
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The changes in the upload capacity distribution at time 
scales of several months can be addressed by updating the 
cheat sheet with new measurements. In additional experiments 
(included in our tech report ifTHl due to lack of space), 
we find that significant changes in the mean or even the 
variance of upload capacity distribution indeed necessitate a 
new set of measurements. For example, for the same mean 
upload capacity, we find that increasing the variance of upload 
capacities reduces the swarm performance. 

4) Effect of measurement testbed: The measurement-based 
model requires network conditions to remain relatively similar 
to the environment in which the model's measurements were 
obtained. We repeated the experiment shown in Figure [2] on 
two other testbeds - Amazon EC2 lfl9l . and a local cluster. 
For the EC2 experiment, we select equal number of machines 
from five geographic locations to differentiate the EC2 testbed 
from the local cluster which has microsecond round trip 
latencies. In Figure |4] we compared the swarm performance 
on the three testbeds for a peer arrival rate of A = 1/5/s. 
Swarm performance on EC2 is up to 30 KBps higher than on 
PlanetLab. Experiments on the local cluster show even better 
swarm performance than on EC2. 

Swarm performance differs on the three testbeds as their 
effective upload capacities are different. The round-trip times 
in the local cluster are much smaller than in PlanetLab which 
reflects in the form of higher effective upload capacities and 
better performance. EC2 only has a small extent of geographic 
diversity (five different locations), so neighbor relationships 
between peers in the same data center tend to dominate (a 
clustering effect that has also been alluded to by prior work 
[15 1). This clustering effect again results in the form of EC2 
nodes having higher effective upload capacities. 

5) Summary and limitations: Although the measurement- 
based model can capture the dependence on four key swarm 
parameters, it still has several limitations. The most critical 
limitation is the extensive measurement needed to build a 
cheat sheet. For a single file size, our measurement take a few 
hundred hours on PlanetLab. A content distributor maintaining 
a few such tables for common file sizes may require a few 
thousand hours of measurements or even more. 

The second limitation is the difficulty in estimating two 
of the model parameters - upload capacity distribution and 
peer arrival rates. Upload capacity distribution is difficult to 
estimate for several reasons - peers may download files from 
multiple swarms simultaneously or otherwise limit their up- 
load capacity, and network conditions can significantly change 



the effective upload capacity as shown in Section III-B4 



Estimating peer arrival rates is challenging primarily because 
users may abort the download before completion and return 
later to resume a download as shown in prior work |20|, [21 1. 
Therefore, the model also needs to account for peer arrivals 
and departures in the middle of a download. In combination, 
the difficulty of estimating all the model parameters can make 
the measurement-based model ineffective in practice. 

IV. SwarmServer system 

In this section, we present an implemented prototype of our 
system, SwarmServer, to compare different controller strate- 
gies. We begin with a brief description of our implementation 



and the content distribution objectives that we use for our 
comparison. Then, we discuss the design of model-based, 
dynamic, and static controllers implemented in SwarmServer. 

Implementation: SwarmServer system is implemented in 
Python and consists of nearly 5000 lines of code. The system 
does not require any modification to the BitTorrent protocol 
for either the peers or the tracker. Our implementation uses 
the instrumented BitTorrent client developed by Legout et al. 
l22l . which we modified to enable us to change the maximum 
upload bandwidth of the client without restarting it. 

Content distribution objectives: We compare controller 
strategies on three content distribution objectives. 

• MIN_AVG: Minimize the average download time across 
all peers in all swarms for a given total server bandwidth. 

• MIN_MAX: Minimize the maximum value of the aver- 
age download time across swarms for a given total server 
bandwidth. 

• MIN_COST: Minimize the total server bandwidth while 
achieving a set of specified target download times for 
each swarm. 

A. Model-based controller 

The model-based controller - CheatSheet - allocates server 
bandwidth by solving an optimization problem using the 



measurement-based model developed in Section III Next, we 
describe the optimization formulations used by CheatSheet 
to calculate bandwidth allocation for each of the objectives 
introduced above. We assume that there are a total of k swarms 
and the average upload capacity, arrival rate, and file size of the 
i'th swarm 1 < i < k are given by Xi, fii, Si respectively. The 
goal is to determine server bandwidth allocations {.Xi}i<i<fe 
so as to optimize the desired objective. 
Optimization formulation for MIN_AVG: 



mm ^2 ^i T i 

Ki<k 



(2) 



subject to 

n = Si/f(xi,fj,i,Xi,Si), l<i<k (3) 
]T Xi < X (4) 

l<i<k 

The first constraint <|3j above simply rephrases Equation (JT| 
relating the average download time r to the server bandwidth 
x and other swarm parameters. The second constraint above 
limits the total bandwidth the server can allocate to all swarms. 

CheatSheet uses its measured knowledge of /(.) to solve 
this optimization problem. If /(.) is known to be smooth 
and concave in x, MIN AVG can be solved using a greedy 
gradient-ascent strategy that computes a unique, optimal solu- 
tion as follows: (1) Start with xi = X2 = ■ • • = Xu = A 
for a small A; (2) Allocate the next A units of capacity 
(divided equally) to the swarm(s) with the largest value(s) 
of the gradient A,/ (a;,, Xi, Si); (3) If not all X units of 
capacity have been allocated, goto (2). Else terminate. 

If /(.) is piecewise linear and concave, the above strategy 
still works, but the resulting solution may not be unique. In 
order to compute a unique optimal solution, CheatSheet cleans 
the measured /(.) by fitting smooth and concave polynomial 
curves for each line in Figure [2] We assume that this data 
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cleaning has been already performed while describing the 
solutions to the next two objectives as well. 
Optimization formulation for MIN_MAX: 



min( max (rj)) 

Ki<k 



(5) 



subject to the same constraints as ([3]) and Q above. 

If /(.) monotonically increases with x, MINMAX can be 
solved optimally using a simple greedy heuristic. For a target 
rate y, let x — f^ 1 {y 1 jU, A, S) denote the server bandwidth 
x required to achieve an average download time of S/y. The 
heuristic is as follows: (1) Initialize target rate y = A for a 
small A; (2) Set x t = Mi, Ai, Si), 1 < i < k; (3) If 

bandwidth allocation required to achieve the target is feasible, 
i.e., %i < AT), increment target rate y to y + A and goto 
(2). Else, terminate. 

Optimization formulation for MIN_COST: 



min(a;i 



Xk) 



subject to 

n = S/ f(xi,Hi,Xi,Si), l<i<k 



(6) 



(7) 



If /(.) is invertible, then MIN_COST can be solved by 
setting Xi = f^ 1 {S/r l , fi, A, S). 

B. Dynamic controller 

We implement three dynamic controllers: AIAD, Leveler, 
and AntFarm. 

AIAD optimizes the MINCOST objective and works as 
follows. Suppose the target average download time of the 
swarm is r and the file size is S. AIAD initializes the 
server bandwidth x to S/t. Once every epoch, it measures 
the average download rate, y, of peers in the swarm. If 
S/t > y, it increases the server bandwidth x by A. Otherwise 
it decreases x by A, except in the case that the decrement 
would cause x to dip below a minimum bandwidth threshold. 
Our implementation sets the epoch length to 200 s A to 10 
KBps, and the minimum bandwidth threshold to 5 KBps. 

Leveler optimizes the MIN MAX objective. At the start, 
Leveler assigns equal bandwidth to all swarms. Once every 
epoch, Leveler measures the average download rate of all 
swarms. The server bandwidth is increased by a small, fixed A 
for swarms whose download rate is lower than the median of 
average download rates. Similarly, Leveler reduces the server 
bandwidth by A for each swarm with average download rate 
higher than the median value. Similar to AIAD, Leveler never 
reduces the server bandwidth allocated to a swarm below 
a minimum threshold. Epoch length, A, and the minimum 
bandwidth are the same as in AIAD. 

AntFarm optimizes the MIN_AVG objective and is based on 
the algorithm in the AntFarm paper JS] . At the start, AntFarm 
allocates a small initial bandwidth to every swarm, and then 
assigns the server bandwidth to swarms in small increments 
until all server bandwidth is used up. 

In steady state, AntFarm computes the bandwidth allocation 
using response curves for each swarm that predict the swarm 
performance as a function of server bandwidth. AntFarm 
measures download rates of peers periodically to obtain a set 
of sample data points of the form (server bandwidth, swarm 
performance). The response curve for a swarm is computed 



by fitting a concave, piecewise-linear curve to this set of 
data points. Given the response curves for all swarms, their 
bandwidth allocation is determined using a gradient-ascent 
algorithm similar to that used by Cheatsheet's to optimize the 
MINAVG objective. We refer the reader to our tech report 
lfl8ll for a detailed description of our implementation. 



C. Static controller 

We implement three static controllers. (1) BitTorrent sets an 
upload limit at the server for a set of swarms but does not set 
a per-swarm limit. The server bandwidth to each swarm by 
the server is determined by the number of peers connected to 
the server. (2) EqualSplit splits the available server bandwidth 
equally among all swarms. (3) PropSplit splits total server 
bandwidth proportional to the peer arrival rate for each swarm. 

V. Evaluation 

Our comparison of controller strategies, presented in this 
section, answers two main questions: (1) Do dynamic and 
model-based controllers improve performance over static con- 
trollers? If yes, then by how much? (2) Which type of 
controller, dynamic or model-based, performs better for the 



objectives in Section IV ? Our experiments show that model- 



based controller outperforms dynamic controllers on all three 
objectives we compared. Static controllers cannot equal a 
model-based controller either; they perform well on some 
workloads and objectives but fare poorly on others. 

Experimental setup: We performed our evaluation on 350 
PlanetLab nodes using an instrumented BitTorrent client [22|. 
Upload and download capacities of peers and peer inter-arrival 
times follow the same distributions as in our measurements. 

Due to limited upload capacities and daily data transfer limit 
on PlanetLab, we focus on experiments with small file sizes. 
File download time will increase in proportion to the file size 
as we cannot increase upload capacity significantly. Thus, an 
experiment with a 1 GB file could take lOOx longer to finish 
than an experiment with a 10 MB file, if the same number 
of file downloads occur in both experiments (assuming both 
swarms have same aggregate demand XS). 

A. Average download time 

First, we compare controllers on the MIN AVG objective. 
We select a workload consisting of 20 swarms whose mean 
arrival rates are chosen according to a Zipf distribution with 
parameter 1.5. The mean arrival rate of the most popular 
swarm and the least popular swarm is 0.5/s and 0.0055/s 
respectively. Each swarm distributes a file of size 10 MB. The 
total server bandwidth is set to 200 KBps. 

Figure [5] shows how the average download time changes 
over time for the different compared schemes. The average is 
computed using the download time of peers that completed 
their download within the previous 2000 sec interval as well 
as the resident time, i.e., the time since arrival, for peers whose 
downloads are under progress. 

There are two main observations from the experiment in 
Figure [5] First, in the initial phase, EqualSplit, BitTorrent 
and AntFarm incur much higher average download times than 
PropSplit and CheatSheet, and their average download times 
take considerably longer to stabilize. Second, even after all 
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Fig. 5. For the MIN_AVG objective, both static 
(e.g., Equal Spit) and dynamic controllers (e.g., 
AntFarm) incur a higher download time in the 
initial phase as well as in steady state. 



Fig. 6. Server bandwidth to swarms by con- 
trollers to minimize the average download time. 
Swarms are ordered left to right in decreasing 
order of popularity. 



Fig. 7. For the MIN_MAX objective, Cheat- 
Sheet and EqualSplit perform nearly the same, 
while PropSplit performs the worst. 



controllers have reached steady state, CheatSheet achieves a 
download time that continues to be lower (by at least 25%) 
compared to all other schemes (that perform roughly similarly 
in steady state in this experiment). 

The explanation for these observations is as follows. Equal- 
Split, BitTorrent and AntFarm have a very high download 
time at the start of the experiment because they assign a 
small server bandwidth to large and small swarms alike. If 
the initial server bandwidth is small, a huge number of peers 
build up in highly popular swarms, which is reflected in the 
corresponding download time curves that rise rapidly. For 
example, the download times in EqualSplit increase rapidly 
until about 1500 sec as no peers have departed until then. At 
this point, the download time drops sharply as a result of a 
horde of peer departures that occur when the last block in 
a swarm has been uploaded by the server. In contrast, both 
CheatSheet and PropSplit assign higher bandwidth to popular 
swarms from the start, so peer departures start much quicker in 
popular swarms considerably reducing their average download 
times. We note here that CheatSheet is implemented so as to 
begin with an allocation identical to PropSplit until it has a 
stable estimate of peer arrival rates, at which point it switches 
to the model-based optimal allocation. 

EqualSplit, BitTorrent and AntFarm take considerably 
longer to reach steady state because the number of peers in 
highly popular swarms goes through multiple rounds of ramp- 
ups followed by bulk departures before stabilizing. In this 
experiment, AntFarm takes the longest time to converge to 
a steady state because after assigning 5 KBps to each swarm 
at the beginning, it allocates remaining bandwidth in small 
chunks of 5 KBps once every 200 sec. AntFarm requires many 
such 200 sec epochs in order to build a stable response curve 
for all swarms, resulting in higher download times during this 
convergence phase. We have observed (not shown for brevity) 
that reducing the epoch length does not help and sometimes 
hurts performance as it increases the measurement error in 
learned response curves. 

Why does CheatSheet outperform other schemes even in 
steady state? Figure [6] shows the steady-state allocations of 
server bandwidth achieved by different schemes that explain 
this observation. Swarms are ordered from left to right in 
decreasing order of popularity. We show only the top 10 most 
popular swarms for clarity of presentation. CheatSheet uses 
the model to predict that the most popular swarm is mostly 



self-sustaining and therefore needs only a small bandwidth 
to achieve healthy download times. Compared to other con- 
trollers, CheatSheet assigns higher bandwidth values to the 
next four popular swarms that belong to a regime where 
a small amount of server bandwidth disproportionately im- 
proves performance, which considerably reduces the average 
download time. PropSplit and AntFarm by design assign the 
most bandwidth to the most popular swarm, but the extra 
server bandwidth hardly benefits that swarm. BitTorrent is 
biased more towards the popular swarms (as it receives more 
peer connections from these swarms compared to singleton 
swarms), but its allocation is nevertheless sub-optimal. Equal- 
Split clearly makes a sub-optimal decision by allocating equal 
bandwidth to all swarms in the light of the above reasons. 

B. Min-max average download time 

Next, we compare controllers on the MIN MAX objective, 
i.e., minimizing the average download time of the swarm that 
has the worst average download time. We evaluate on the Zipf 
workload in the previous subsection and set the total server 
bandwidth to 500 KBps in these experiments. 

Figure [7] shows the average download time of the swarm 
with the maximum average download time (referred to as 
MAD time in this discussion). We observe that, even though 
the workload is the same, the relative performance of con- 
trollers is different compared to the experiment in the previous 
subsection. The MAD time achieved by PropSplit is twice 
as worse as other controllers that have relatively smaller 
differences between them. Both EqualSplit and CheatSheet 
achieve the lowest MAD time. BitTorrent incurs a higher MAD 
time in comparison to EqualSplit. The performance of Leveler 
varies with time because it changes the server bandwidth 
to each swarm periodically and struggles to converge to a 
steady bandwidth allocation as it shuffles bandwidth across 
20 swarms. The reason (not visible in the figure) is that 
different swarms take different times to manifest the effect 
of the most recent change. Leveler sometimes "panics" and 
allocates more bandwidth to the currently worst swarms too 
quickly and at other times is too slow to move bandwidth 
away from swarms that could do without it. The fluctuating 
performance of Leveler reveals that it is nontrivial to design 
a robust dynamic controller. 

Unpopular swarms, i.e., swarms with a small peer arrival 
rate, impact the MAD time significantly in this experiment. 
Unpopular swarms require higher bandwidth than popular 
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Fig. 8. Server bandwidth set by controllers. AIAD fails to meet the target 
for A = 0.12/s as it drastically reduces server bandwidth near 3000 s. 

swarms to achieve the same download time (Figure [2]). Due to 
the Zipf popularity distribution, a majority of swarms for this 
workload are unpopular. PropSplit incurs the highest MAD 
times because it assigns the least bandwidth to the most 
unpopular swarm, which significantly increases the download 
time of that swarm. EqualSplit, unlike PropSplit, assigns equal 
bandwidth to all the swarms and hence has a much smaller 
MAD time. CheatSheet performs the same as EqualSplit 
because the unpopular swarms in the workload have nearly 
the same performance in both cases. Due to a large number 
of unpopular swarms, CheatSheet only assigns 5 KBps more 
bandwidth to each unpopular swarm than EqualSplit, which 
does not sufficiently impact the MAD times. 

Does EqualSplit achieve the least MAD time for all work- 
loads? The answer is no. On a workload dominated by popular 
swarms, EqualSplit results in 50% higher MAD time than 
CheatSheet (refer to tech report lfT8l for this experiment). This 
experiment illustrates that performance of static controllers 
such as EqualSplit can vary depending on the workload. 

C. Target download time 

Next, we compare CheatSheet and AIAD against the 
MIN_COST objective. We do not compare against the simplis- 
tic static schemes as they are designed to always use all avail- 
able capacity (and can therefore be made to appear arbitrarily 
worse by choosing a sufficiently low target download time in 
an experiment). Our workload for this experiment consisted 
of six swarms with peer arrival rates of 0.5/s, 0.14/s, 0.12/s, 



Measured download rate 



0.1/s, 0.08/s, and 0.01/s. All swarms distributed a file of size 
10 MB. The target download time for all the swarms is set 
to 150 sec. We only present detailed results for arrival rates 
0.5/s, 0.12/s, and 0.01/s here. Results for other arrival rates are 
qualitatively consistent and are omitted due to lack of space. 

Figure [8] shows the average download time achieved by each 
strategy over the duration of the experiment (figures on the 
left column) and shows the corresponding server capacity set 
by the controllers over the same duration (figures on the right 
column). The actual bandwidth consumed at the server is very 
close to the configured capacity shown in the figure. 

CheatSheet meets the tar- 
get download time well in all 
cases, but AIAD sometimes 
significantly exceeds the tar- 
get download time as in the 
later part of Figure [8|c). This 
is because AIAD is not al- 
ways able to accurately es- 
timate the relation between 
server bandwidth and the 
download time. To illustrate 
this point, Figure [9] shows the measured download rate of the 
swarm and the server bandwidth limit set by AIAD during 
this experiment. At t = 2200 s, the measured download rate 
of swarm is above the corresponding target download time 
(10 MB / 150 s = 67 KBps). Hence it decreases the server 
bandwidth to 40 KBps at t = 2200 s and then to 30 KBps 
at t = 2400 s. This causes the measured download rate to 
drop sharply which is reflected in the increased download time 
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of peers in Figure 8(c) The download time curve shows an 
increase somewhat later as it is calculated as an average over 
a window of 2000 s. 

We also experimented by changing the interval after which 
AIAD updates bandwidth to 300 sec, but it continues to 
fluctuate above the target download time. Of course, if the 
bandwidth update interval is increased to a sufficiently high 
value and the bandwidth increments/decrements made small, 
the AIAD controller will converge to the target download rate. 
However, it will take longer to converge and will be less 
responsive if peer arrival rates change. 

CheatSheet consumes much less bandwidth compared to 
AIAD for A = 0.5/s especially in the first 2000 seconds of the 
experiment. While AIAD takes several cycles of measurement 
and perturbations to reach the bandwidth allocation, Cheat- 
Sheet directly jumps to the minimal required bandwidth using 
its model. 

D. Summary and discussion 

In summary, our evaluation shows that bandwidth allocation 
done by static controllers is hit-or-miss. A static controller that 
works well for one objective and workload combination may 
perform poorly for others. This is intuitively unsurprising and 
is also consistent with the findings in prior work analyzing 
a specific optimization objective \%\. For a fixed performance 
objective however, the simplicity of static controllers may out- 
weigh their sub-optimality (e.g., EqualSplit for the MIN MAX 
metric or PropSplit for the MIN_AVG metric). 
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Our evaluation also shows that designing a dynamic con- 
troller for scenarios involving peer arrivals and departures is 
nontrivial. Although dynamic controllers are generally supe- 
rior to any given simplistic static scheme when evaluated over 
a range of objectives and workloads, we find that they are 
far from optimal. Indeed, in some scenarios, simple schemes 
like EqualSplit or PropSplit outperform dynamic control. The 
reason is that measuring the relationship between swarm 
performance and allocated bandwidth in an online manner 
is nontrivial. As a result of measurement errors, a dynamic 
control scheme is vulnerable to prolonged convergence delays 
or persistent fluctuations. 

The experiments in this paper suggest that a model-based 
approach is feasible and promising. We find that when a 
model-based controller is given a cheat sheet based on prior 
measurements in the regime of interest, it consistently out- 
performs both static and dynamic controllers for different 
objectives and workloads. 

Nevertheless, having gone through the experience of making 
a model-based controller work, our conclusion is that, in its 
current form, the complexity of the model-based approach 
outweighs its advantages. The extensive set of measurements 
required to build the model, reduce the viability of this ap- 
proach. Further, the challenges in estimating model parameters 
such as peer arrival rates and upload capacity distribution 



of peers (see Section III-B5 1 can reduce the effectiveness of 
model-based approach. 

VI. Related work 

Our primary contribution is a comparative analysis of dif- 
ferent categories of bandwidth controllers for client-assisted 
content delivery systems and the design and implementation 
of a model-based control approach, that to our knowledge has 
not been attempted before. Our work builds upon a large body 
of prior work that can be grouped into dynamic controllers, 
models of swarm behavior, and incentive strategies. 

Dynamic controllers: AntFarm |8| and VFormation |9| are 
closely related to ours. However, both these works adopt a 
dynamic controller approach. While AntFarm monitors aver- 
age download rate of each peer, VFormation uses more de- 
tailed measurements by monitoring propagation of each block 
through the swarm. Our comparison of controller strategies 
does not include V-Formation because its implementation is 
proprietary and not available publicly. Dynamic controllers 
have also been studied for of live-streaming P2P systems |23|. 

Models of swarm behavior: Qiu lUTll . Fan llT2l . and Liao 
lf]~3l analytically model BitTorrent to derive expressions for 
average download time and other swarm metrics. But, their 
models make assumptions that over-simplify swarm behavior, 
e.g., homogenous upload capacities IfTTj . fixed number of 
peers |fl3l , and seeds contributing their full upload capacity 
ifTTl . [12 1. To address these concerns, we model swarm per- 
formance based on actual swarm measurements. 

Incentive strategies: Several BitTorrent clients that im- 
prove BitTorrent's incentive strategies have been proposed, 
such as BitTyrant fl6l . Levin et al.'s client [24 1 and FairTorrent 
[25 1. Other swarming systems (incompatible with BitTorrent) 
incentivize peers to contribute bandwidth through virtual cur- 
rencies, e.g., Dandelion [26 1, or tokens, e.g., AntFarm [8|. Our 



position is that a large majority of users use BitTorrent clients 
as-is or use unmodifiable closed-source clients, e.g., Akamai's 
NetSession [3|, so incentive issues are less important. 

VII. Conclusions 
In this paper, we performed a comparative evaluation of 
strategies to control server bandwidth in client-assisted content 
delivery systems. As part of this effort, we introduced a new 
approach referred to as model-based control and presented 
the design and implementation of a model-based controller, 
CheatSheet, that uses a concise model based on a priori offline 
measurement of swarm performance as a function of the server 
bandwidth and other swarm parameters. Our experiments show 
that simple static strategies are unreliable as they perform well 
on some workloads and objectives but fare poorly on others. 
Dynamic control can also lead to a sub-optimal performance as 
it is prone to prolonged convergence delays and persistent fluc- 
tuations. In comparison, a model-based approach consistently 
outperforms both static and dynamic approaches provided 
it has access to detailed measurements in the regime of 
interest. Nevertheless, the broad applicability of a model-based 
approach may be limited in practice because of the overhead 
of developing and maintaining a comprehensive measurement- 
based model of swarm performance in each regime of interest. 
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